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A re-examination of text categorization methods I 
Yiming Yang, Xin Liu 

Proceedings of the 22nd annual international 
ACM SIGIR conference on Research and 
development in information retrieval 



August 1999 



Publisher: ACM Press 
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Special issue on special feature: Distributional word B 
clusters vs. words for text categorization 
Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, Yoad Winter 
March2 ° 03 The Journal of Machine Learning Research, volume 3 

Publisher: MJT Press 

FuO text svaiabto: ^gpdtM78.S3 KB) Additional Information: fufl citation, abrtract faflflg turn 

We study an approach to text categorization that 
combines distributional clustering of words and a 
Support Vector Machine (SVM) classifier. This word- 
cluster representation is computed using the recently 
introduced Information Bottleneck method, which 
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generates a compact and efficient representation of 
documents. When combined with the classification 
power of the SVM, this method yields high performance 
in text categorization. This novel combination of SVM 
with word-cluster representation ... 

L Text categorization: A repetition based measure for B 
^ verification of text collections and for text 
categorization 

Dmitry V. Khmelev, William J. Teahan 

Proceedings of the 26th annual international 
ACM SIGIR conference on Research and 
development in informaion retrieval 

Publisher ACM Press 

Full fat svsHHIs: Mf[187.26 KB) AMMaiH WmUtOK M aMon. sunset ahmu dltn. Ma Mm 

We suggest a way for locating duplicates and 
plagiarisms in a text collection using an R-measure, 
which is the normalized sum of the lengths of all suffixes 
of the text repeated in other documents of the 
collection. The R-measure can be effectively computed 
using the suffix array data structure. Additionally, the 
computation procedure can be improved to locate the 
sets of duplicate or plagiarised documents. We applied 
the technique to several standard text collections and 
found that they ... 

Keywords: cross-entropy, language modeling, text 
categorization, text compression 



Meaningful term extraction and discriminative term 
selection in text categorization via unknown-word 
methodology 

Yu-Sheng Lai, Chung-Hsien Wu 

.tartan ACM -Transactions on Asian Language 

Information Processing (TALIPY Volume 1 Issue 1 

Publisher ACM Press 



In this article, an approach based on unknown words is 
proposed for meaningful term extraction and 
discriminative term selection in text categorization. For 
meaningful term extraction, a phrase-like unit (PLU)- 
based likelihood ratio is proposed to estimate the 
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likelihood that a word sequence is an unknown word. On 
the other hand, a discriminative measure is proposed for 
term selection and is combined with the PLU-based 
likelihood ratio to determine the text category. We 
conducted several experim ... 

Keywords: AC-machine, dimensionality reduction, 
discriminability, discriminative term selection, 
inconsistency problem, meaningful term extraction, n- 
gram, phrase-like unit, sparse data problem, term 
adaptation, term purification, text categorization, text 
indexing, unknown word detection, vector space 
modeling 



1 Text categorization: A scalability analysis of B 

^ classifiers in text categorization 
Yiming Yang, Jian Zhang, Bryan Kisiel 

Proceedings of the 26th annual international 
ACM SIGIR conference on Research and 
development in informaion retrieval 

Publisher ACM Press 

Full text avaBabto: pd.ff242.fl1 KB) Additional Information: fuflctation abstzasL EttteDfflCfll lndn team 

Real-world applications of text categorization often 
require a system to deal with tens of thousands of 
categories defined over a large taxonomy. This paper 
addresses the problem with respect to a set of popular 
algorithms in text categorization, including Support 
Vector Machines, k-nearest neighbor, ridge regression, 
linear least square fit and logistic regression. By 
providing a formal analysis of the computational 
complexity of each classification method, followed by an 
investigation on the u ... 

Keywords: complexity analysis, hierarchical text 
categorization, power law 



' Feature selection, perception learning, and a usability B 

^ case study for text categorization 

Hwee Tou Ng, Wei Boon Goh, Kok Leong Low 

ACM SIGIR Forum , Proceedings of the 20th 
annual international ACM SIGIR conference on 
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Research and development in information 

retrieval SIGIR "97, Volume 31 Issue SI 



Publisher: ACM Press 
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Special issue on special feature: A divisive B 
information theoretic feature clustering algorithm for 
text classification 

Inderjit S. Dhillon # Subramanyam Mallela, Rahul Kumar 
Man * 2003 The Journal of Machine Learning Research, volume 

Publisher: MIT Press 

Fufl text evtlable: a|g pdfl171,Q7 K6i Additional Information: fufl dtettea abstract dtinoa. index twrns 

High dimensionality of text can be a deterrent in 
applying complex learners such as Support Vector 
Machines to the task of text classification. Feature 
clustering is a powerful alternative to feature selection 
for reducing the dimensionality of text data. In this 
paper we propose a new information-theoretic divisive 
algorithm for feature/word clustering and apply it to text 
classification. Existing techniques for such "distributional 
clustering" of words are agglomerative in nature and 
result in ... 



Session: Exploring the use of linguistic features in B 

domain and genre classification 

Maria Wolters, Mathias Kirsten 

Proceedings of the ninth conference on 
European chapter of the Association for 
Computational Linguistics 

Publisher Association for Computational Linguistics 




The central questions are: How useful is information 
about part-of-speech frequency for text categorisation? 
Is it feasible to limit word features to content words for 
text classifications? This is examined for 5 domain and 4 
genre classification tasks using LIMAS, the German 
equivalent of the Brown corpus. Because LIMAS is too 
heterogeneous, neither question can be answered 
reliably for any of the tasks. However, the results 
suggest that both questions have to be examined 
separately for each ta ... 
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' Text categorization for multiple users based on 
^ semantic features from a machine-readable dictionary 
Elizabeth D. Liddy, Woojin Paik, Edmund S. Yu 

ACM Transactions on Information Systems 



July 1094 



(TOIS) , Volume 12 Issue 3 

Publisher ACM Press 



a ggjodfll 17MB1 

The text categorization module described here provides 
a front-end filtering function for the larger DR-LINK text 
retrieval system [Liddy and Myaeing 1993]. The model 
evaluates a large incoming stream of documents to 
determine which documents are sufficiently similar to a 
profile at the broad subject level to warrant more 
refined representation and matching. To accomplish this 
task, each substantive word in a text is first categorized 
using a feature set based on the semantic Subject 
Field ... 

Keywords: semantic vectors, subject field coding 



Fast supervised dimensionality reduction algorithm B 
with applications to document categorization & 
retrieval 

George Karypis, Eui-Hong (Sam) Han 

Novemtar200 ° Proceedings of the ninth international 

conference on Information and knowledge 

management 

Publisher ACM Press 

Full text available: aCp |odff270 71 KB! Additional Information; fun citation reference, erttnoi Index terma 



1 Special section on data mining for intrusion detection B 
* and threat analysis: Mining e-mail content for author 

identification forensics 

O. de Vel, A. Anderson, M. Corney, G. Mohay 

ACM SIGMOD Record. Volume 30 Issue 4 

Publisher ACM Press 

Fun tax! avatobta: egj pdft9H21 KB) Additional Information: full citation, abstract, reference!, dtinoa. Index terms 

We describe an investigation into e-mail content mining 
for author identification, or authorship attribution, for 
the purpose of forensic investigation. We focus our 
discussion on the ability to discriminate between authors 
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for the case of both aggregated e-mail topics as well as 
across different e-mail topics. An extended set of e-mail 
document features including structural characteristics 
and linguistic patterns were derived and, together with a 
Support Vector Machine learning algorithm, were ... 

1 Classification: Boosting to correct inductive bias in ■ 
^ text classification 

Yan Liu, Yiming Yang, Jaime Carbonell 

2002 Proceedings of the eleventh international 
conference on Information and knowledge 
management 

Publisher ACM Press 

Fuflterieva8abte: ^^Ddff199 02Km Additional Information: M cftatterv abstract afctemi Bfla tenia 

This paper studies the effects of boosting in the context 
of different classification methods for text 
categorization, including Decision Trees, Naive Bayes, 
Support Vector Machines (SVMs) and a Rocchio-style 
classifier. We identify the inductive biases of each 
classifier and explore how boosting, as an error-driven 
resampling mechanism, reacts to those biases. Our 
experiments on the Reuters-21578 benchmark show 
that boosting is not effective in improving the 
performance of the base classifiers ... 

Keywords: boosting, inductive bias, machine learning, 
text classification 



Fast and accurate text classification via multiple linear B 

discriminant projections 

Soumen Chakrabarti, Shourya Roy, Mahesh V. 

Soundalgekar 

Jhe VLDB Journa | _ Tne international Journal 

on Very Large Data Bases, Volume 12 Issue 2 

Publisher. Sprtnger-Vertag New York, Inc. 

Full text evalsble: ^ij|pdrM5e 36 KB) Additional Information: ftjfl citation, abstract, citinc« index twm* 

Abstract.Support vector machines (SVMs) have shown 
superb performance for text classification tasks. They 
are accurate, robust, and quick to apply to test 
instances. Their only potential drawback is their training 
time and memory requirement. For n training instances 
held in memory, the best-known SVM implementations 
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take time proportional to n a , where a is typically 
between 1.8 and 2.1. SVMs have been trained on data 
sets with several thousand instances, but Web direct ... 

Keywords: Discriminative learning, Linear 
discriminants, Text classification 



Text classification using string kernels I 
Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello 
Cristianini, Chris Watkins 

" aa *** 2 The Journal of Machine Learning Research, volume 2 

Publisher MIT Press 

Full text available: ^»jp<Jft21607 KB1 Additional Information: fufl citation absSoA reference*, cttnoa. index terma 

We propose a novel approach for categorizing text 
documents based on the use of a special kernel. The 
kernel is an inner product in the feature space 
generated by all subsequences of length <em>k</em>. 
A subsequence is any ordered sequence of 
<em>k</em> characters occurring in the text though 
not necessarily contiguously. The subsequences are 
weighted by an exponentially decaying factor of their full 
length in the text, hence emphasising those occurrences 
that are close t ... 

Keywords: approximating kernels, kernels and support 
vector machines, string subsequence kernel, text 
classification 



An example-based mapping method for text 
categorization and retrieval 
Yiming Yang, Christopher G. Chute 

ACM Transactions on Information Systems 

(TOIS) , Volume 12 Issue 3 

Publisher. ACM Press 

FuO Uxt available; ag^rxtfM 7B MB\ Additional Information: fafl citation, abstract reference! dMnoi. Index terms 

A unified model for text categorization and text retrieval 
is introduced. We use a training set of manually 
categorized documents to learn word-category 
associations, and use these associations to predict the 
categories of arbitrary documents. Similarly, we use a 
training set of queries and their related documents to 
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obtain empirical associations between query words and 
indexing terms of documents, and use these 
associations to predict the related documents of 
arbitrary queries. A Linear Le ... 

Keywords: document categorization, query 
categorization, statistical learning of human decisions 



Posters: Machine learning methods for Chinese web 

page categorization 

Ji He, Ah-Hwee Tan, Chew-Lim Tan 

October 2000 — ^ _ m 

Proceedings of the second workshop on 
Chinese language processing: held in 
conjunction with the 38th Annual Meeting of 
the Association for Computational Linguistics - 
Volume 12 

Publisher: Association (or Computational Linguistics 

Futl text available: ^^Dd«70B 21 KB) Additional Information: fuflcrtation. abstract reference* 

This paper reports our evaluation of k Nearest Neighbor 
(kNN), Support Vector Machines (SVM), and Adaptive 
Resonance Associative Map (ARAM) on Chinese web 
page classification. Benchmark experiments based on a 
Chinese web corpus showed that their predictive 
performance were roughly comparable although ARAM 
and kNN slightly outperformed SVM in small categories. 
In addition, inserting rules into ARAM helped to improve 
performance, especially for small well-defined 
categories. 

Machine learning in automated text categorization 
Fabrizio Sebastiani 

ACM Comoutino Survevs fCSlim , Volume 34 Issue 1 

Publisher ACM Press 

Full text available: tB S PdfTS24.4t KBl Additional 



The automated categorization (or classification) of texts 
into predefined categories has witnessed a booming 
interest in the last 10 years, due to the increased 
availability of documents in digital form and the ensuing 
need to organize them. In the research community the 
dominant approach to this problem is based on machine 
learning techniques: a general inductive process 
automatically builds a classifier by learning, from a set 
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of preclassified documents, the characteristics of the 
categories. ... 

Keywords: Machine learning, text categorization, text 
classification 



Hierarchical classification of Web content 
Susan Dumais, Hao Chen 

Proceedings of the 23rd annual international 
ACM SIGIR conference on Research and 
development in information retrieval 



July 2000 



Publisher ACM Press 



This paper explores the use of hierarchical 
structure for classifying a large, 
heterogeneous collection of web content. 
The hierarchical structure is initially used 
to train different second-level classifiers. 
In the hierarchical case, a model is 
learned to distinguish a second-level 
category from other categories within the 
same top level. In the flat non-hierarchical 
case, a model distinguishes a second-level 
category from all other second-level 
categories. Scoring rules can further take 
ad ... 

Keywords: Web hierarchies, 
classification, hierarchical models, 
machine learning, support vector 
machines, text catergorization, text 
classification 



Special issue on kernel methods: One-class svms for 



http://portal.acm.org/results.cfin?CFID=71643739&CFTOKEN=30... 5/18/06 



Results (page 1): +text +categorization +computer +vector Page 10 of 1 1 



document classification 
Larry M. Manevitz, Malik Yousef 



Mnh2002 The Journal of Machine Learning Research, volume 2 

Publisher: MIT Press 



FuO taxi ■valafato: Jgl j Pdfl2Q3 03 KB\ Addrtiooal 



We implemented versions of the SVM appropriate for 
one-class classification in the context of information 
retrieval. The experiments were conducted on the 
standard Reuters data set. For the SVM implementation 
we used both a version of Schoelkopf et al. and a 
somewhat different version of one-class SVM based on 
identifying "outlier" data as representative of the 
second-class. We report on experiments with different 
kernels for both of these implementations and with 
different represe ... 

" Text classification: Enhanced word clustering for 
^ hierarchical text classification 

Inderjit S. Dhillon, Subramanyam Mallela, Rahul Kumar 
^ Proceedings of the eighth ACM SIGKDD 
international conference on Knowledge 
discovery and data mining 

Publisher: ACM Press 

Full tea »v««»M. odff993 07 KR1 AdJMon.1 Intocmrtan tun rim/on iilulmel mfcrmy... otinjs jmlm Urn 

In this paper we propose a new information-theoretic 
divisive algorithm for word clustering applied to text 
classification. In previous work, such "distributional 
clustering" of features has been found to achieve 
improvements over feature selection in terms of 
classification accuracy, especially at lower number of 
features [2, 28]. However the existing clustering 
techniques are agglomerative in nature and result in (i) 
sub-optimal word clusters and (ii) high computational 
cost. In order to expli ... 
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